Skip to content

fix(core,cli): close tool_use↔tool_result invariant across all failure paths#4176

Merged
wenshao merged 20 commits into
QwenLM:mainfrom
wenshao:fix/persist-partial-tool-use-on-stream-error
May 21, 2026
Merged

fix(core,cli): close tool_use↔tool_result invariant across all failure paths#4176
wenshao merged 20 commits into
QwenLM:mainfrom
wenshao:fix/persist-partial-tool-use-on-stream-error

Conversation

@wenshao

@wenshao wenshao commented May 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

Fixes an unrecoverable wedge on weak-network connections (e.g. on a train) when calling DeepSeek-V4-Pro (or any Anthropic-compatible backend) through the Anthropic protocol. Same wedge reproduces under three additional failure modes surfaced in review; all closed in this PR.

Symptom — the model 400s with:

unexpected messages.N.content.0: tool_use_id found in tool_result blocks:
call_00_xxx. Each tool_result block must have a corresponding tool_use
block in the previous message.

and pressing Ctrl+Y to retry yields the same error indefinitely. The chat session is permanently wedged.

Root cause — mid-stream tool_use drop

The Anthropic-compatible SSE emits tool_use content blocks as content_block_startinput_json_delta*content_block_stopmessage_deltamessage_stop. In anthropicContentGenerator.processStream, the functionCall chunk is yielded at content_block_stop.

When the SSE drops between content_block_stop of a tool_use and the terminal message_stop (typical for flaky train/airplane WiFi):

  1. The functionCall chunk has already been yielded → Turn.run registers a ToolCallRequest event → useGeminiStream schedules the tool after the for-await exits.
  2. But processStreamResponse's this.history.push({role:'model'…}) never runs — the for-await throws before reaching it.
  3. The tool finishes asynchronously → handleCompletedToolssubmitQuery(…, ToolResult)geminiChat.sendMessageStream pushes user:[functionResponse] into history.
  4. History is now […, user:"query", user:[tool_result]]. The converter emits two consecutive user messages where the second carries tool_result blocks but the preceding message has no matching tool_use → 400.
  5. stripOrphanedUserEntriesFromHistory only pops trailing user entries. The lost tool_use is unrecoverable, and the chat is wedged for the rest of the session.

Fixes

1. Persist partial assistant turn when stream errors mid tool_use

processStreamResponse (packages/core/src/core/geminiChat.ts) wraps its for-await in a try/catch. On stream error, if any functionCall chunk was already yielded (hasToolCall === true) and we accumulated parts, persist the partial assistant turn into history before re-throwing — so the eventual tool_result submission has its matching tool_use.

Plain-text partial turns (no functionCall) are intentionally not persisted: the Retry path pops the trailing user prompt and re-issues it; a stale partial-text model turn between them would either bias the retry or surface as duplicate output.

The shared processStreamResponse covers both Anthropic and OpenAI content generators, so the fix applies to all anthropic-compatible and openai-compatible providers — not just DeepSeek.

2. Close tool_use ↔ tool_result invariant at every failure point

Surfaced in @yiliang114's review: the partial-history push above relies on the React tool scheduler eventually producing a matching tool_result. Three races break that contract:

  • Race A — Ctrl+Y while the in-flight tool hasn't finished. Trailing entry is model[tool_use], not user, so stripOrphanedUserEntriesFromHistory doesn't pop it. Retry pushes a fresh user turn after the orphan tool_use. Meanwhile the scheduler's onAllToolCallsComplete is single-shot AND gated on isResponding (useGeminiStream.ts:1971), so the eventual real tool_result is silently swallowed.
  • Race B — process crash / OOM / SIGKILL between the partial-tool_use push and tool completion. The dangling model[tool_use] wedges the first API call after --resume.
  • Race C — manual JSONL edits / external tooling leaving the same dangling shape.

Three pieces working together close every race:

  1. repairOrphanedToolUseTurns(history) in geminiChat.ts — walks history left-to-right and synthesizes an error-typed functionResponse for every functionCall whose id isn't echoed back in the next user turn. Appends to an existing user turn when present, otherwise inserts a new one.
  2. Wiring at two points:
    • client.startChat() after loading the transcript — covers Race B/C (--resume).
    • chat.sendMessageStream immediately after this.history.push(userContent) — covers Race A in-session. The user-supplied turn gets the first chance to close the pair before the synthesizer sees it as dangling, so a Retry of a previous ToolResult submission (lastPrompt is a functionResponse-parts array) lands the real functionResponse instead of a synthetic error.
  3. History-based dedup in handleCompletedTools (useGeminiStream.ts) — before submitting tool results, scan chat.history for any functionResponse.id already present (planted by the repair pass) and drop those toolCalls' responseParts from the wire payload while still markToolsAsSubmitted so the UI/scheduler advances. The dedup runs before the isResponding early-return so the scheduler's single-shot onAllToolCallsComplete doesn't leave the tool stuck in completed-but-not-submitted when a Retry stream is in flight.

This is the qwen-code analogue of Claude Code's yieldMissingToolResultBlocks (query.ts:123-149), split across the core/cli boundary because our React scheduler runs out-of-band from the stream loop (upstream's in-band StreamingToolExecutor can atomically .discard() live tools at the synthesis point; we use history-dedup at handleCompletedTools instead to keep the in-flight scheduler's late real result from colliding with the synthesized error).

Test plan

  • npx vitest run packages/core/src/core/geminiChat.test.ts — 90 tests pass (3 partial-history-push cases + 8 repair-helper unit cases + 2 chat.sendMessageStream integration cases)
  • npx vitest run packages/core/src/core/client.test.ts — 131 tests pass (Retry path mocks updated for new method)
  • npx vitest run packages/cli/src/ui/hooks/useGeminiStream.test.tsx — 92 tests pass (Race A end-to-end dedup integration test)
  • npx vitest run --project '@qwen-code/qwen-code-core' — all 8186 core tests pass
  • tsc --noEmit -p packages/core/tsconfig.json clean
  • CI green on all four commits (Lint + Test×3 platforms + CodeQL)
  • Manual: throttle network, trigger a tool_use response, kill the connection mid-stream, confirm the next user/tool_result submission succeeds instead of 400ing. Ctrl+Y mid-flight and --resume of a session crashed mid-flight.

Related

Closes #4178

…_use

Weak-network failures during an Anthropic-compatible stream (DeepSeek,
api.anthropic.com, etc.) can drop the SSE between a tool_use
`content_block_stop` and the terminal `message_stop`. The functionCall
chunk is already yielded at content_block_stop, so:

  - Turn.run records a ToolCallRequest event.
  - useGeminiStream's for-await exits and schedules the tool.
  - handleCompletedTools eventually fires submitQuery(..., ToolResult)
    and pushes a user[functionResponse] into history.
  - But processStreamResponse's history.push for the model turn never
    ran (the for-await threw first), so the matching tool_use is gone.

The next request body has `user → user[tool_result]` with no tool_use
in between, and the server rejects with HTTP 400:
"tool_use_id ... must have a corresponding tool_use block in the
previous message". Ctrl+Y can't recover because
stripOrphanedUserEntriesFromHistory only strips trailing user
entries — the lost tool_use is unrecoverable, and the session is
wedged.

Wrap the for-await loop in processStreamResponse with try/catch.
When the stream throws AND any functionCall chunk was already
yielded (hasToolCall=true), persist the partial assistant turn to
history before re-throwing. The eventual tool_result submission
then has a matching tool_use and the session can continue.

Plain-text partial turns (no functionCall yielded) are intentionally
NOT persisted: the Retry path pops the trailing user prompt and
re-issues it, so a stale partial-text model turn between them would
either bias the retry or surface as duplicate output.
@github-actions

Copy link
Copy Markdown
Contributor

📋 Review Summary

This PR fixes a critical unrecoverable wedge that occurs when streaming connections drop mid-tool_use on weak networks. The fix ensures partial assistant turns containing tool calls are persisted to history before re-throwing stream errors, maintaining the tool_use/tool_result pairing invariant required by Anthropic-compatible APIs. The implementation is well-reasoned with comprehensive tests and excellent documentation.

🔍 General Feedback

  • Exceptional documentation: The code comments and PR description provide outstanding context about the root cause, including the specific Anthropic SSE emission pattern (content_block_startinput_json_delta*content_block_stopmessage_deltamessage_stop)
  • Targeted fix: The change is surgical—only persisting partial turns when hasToolCall === true, deliberately skipping text-only partials to avoid biasing retry behavior
  • Shared code path benefit: The fix in processStreamResponse automatically covers both Anthropic and OpenAI content generators
  • Test coverage: Two new test cases precisely validate both branches of the new error handling logic

🎯 Specific Feedback

🟡 High

  • File: packages/core/src/core/geminiChat.ts:1320 - The streamError capture pattern is sound, but consider whether the caught error should be logged before re-throwing for debugging purposes. Weak-network issues are notoriously hard to reproduce, and having structured logs when this recovery path triggers would aid future debugging.

    Suggested fix:

      } catch (e) {
        streamError = e;
        // Log for debugging weak-network recovery path
        if (hasToolCall) {
          debugLogger.log('Persisting partial assistant turn due to stream error');
        }
      }

🟢 Medium

  • File: packages/core/src/core/geminiChat.ts:1257-1268 - The comment explaining streamError is extremely detailed (which is good), but it's positioned before the variable is actually used. Consider moving this extensive explanation to the recovery branch at line 1383 where the actual decision logic lives, and keep a shorter note here.

    Suggested restructuring:

      let streamError: unknown = null;  // Captured mid-stream; see recovery logic below
    
      try {
        // ... loop unchanged ...
      } catch (e) {
        streamError = e;
      }
    
      // ... later, at the recovery branch ...
      /**
       * Mid-stream failure recovery: [full explanation currently at line 1257]
       */
      if (streamError !== null) { ... }
  • File: packages/core/src/core/geminiChat.test.ts:702-760 - The test "persists partial assistant turn when stream throws after a tool_use chunk" is thorough but could benefit from explicitly asserting that the history length is exactly 2 (user + model) after the error, making the test's intent even clearer to future readers.

    Suggested addition (after line 758):

        expect(history.length).toBe(2);  // user turn + model turn with tool_use

🔵 Low

  • File: packages/core/src/core/geminiChat.ts:1383 - The recovery comment block is ~20 lines. While the detail is valuable, consider extracting this into a private method like shouldPersistPartialTurn() with an inline JSDoc. This would make the main flow easier to scan and the recovery logic independently testable.

  • File: packages/core/src/core/geminiChat.test.ts - Consider adding a test case that simulates the full retry flow (Ctrl+Y scenario) to verify that the persisted tool_use allows the subsequent ToolResult submission to succeed rather than 400ing. This would be an integration-style test complementing the current unit tests.

✅ Highlights

  • Brilliant root cause analysis: The PR description's explanation of the Anthropic SSE emission pattern and how the drop between content_block_stop and message_stop causes the wedge is exemplary technical writing
  • Discriminating fix: The deliberate choice to persist tool_use partials but NOT text-only partials shows deep understanding of the retry path's behavior and avoids introducing duplicate output issues
  • Comprehensive test coverage: The two new test cases cover both branches of the new logic with realistic stream simulations
  • Cross-provider benefit: Fixing the shared processStreamResponse method means Anthropic, OpenAI, and DeepSeek all benefit without duplicate changes
  • Clear failure mode documentation: The comment explaining why stripOrphanedUserEntries can't recover from this scenario (it only strips trailing user entries, not resurrect lost tool_use blocks) is invaluable for future maintainers

…ry fix

Adds a third case to the partial-history persistence test suite for
reasoning-mode providers (DeepSeek thinking, Claude 4.6+ adaptive): when
the assistant turn streams a thinking block AND a tool_use before the
SSE drops, the partial push must keep the thinking part before the
functionCall so DeepSeek's `injectThinkingOnToolUseTurns` converter
pass sees an existing block on the replayed turn and does not pre-pend
a synthetic empty one (which would discard the model's original
reasoning text).

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a wedge state on weak networks where Anthropic-compatible streams (DeepSeek, Anthropic, etc.) drop after a tool_use content_block_stop but before message_stop. The yielded functionCall chunk causes downstream tool execution and a follow-up tool_result user turn, but no matching tool_use was ever pushed into history, leading to permanent 400s. The fix wraps the stream loop in processStreamResponse with try/catch and persists the partial assistant turn into history when a tool call has already been yielded before re-throwing.

Changes:

  • Capture mid-stream errors in processStreamResponse instead of letting them abort post-loop processing.
  • When hasToolCall is true on error, push the partial assistant turn (thinking + consolidated parts) into history before re-throwing, preserving tool_use/tool_result pairing.
  • Add three vitest cases covering tool-only, thinking+tool, and text-only mid-stream failures.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated no comments.

File Description
packages/core/src/core/geminiChat.ts Adds try/catch around stream iteration and a recovery branch that conditionally persists the partial model turn before re-throwing.
packages/core/src/core/geminiChat.test.ts Adds three new tests verifying partial-turn persistence with tool_use, with thinking+tool, and non-persistence for text-only failures.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread packages/core/src/core/geminiChat.ts Outdated

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fix recovers the common SSE-drop case correctly. Left one non-blocking follow-up inline about closing the tool_use ↔ tool_result invariant at the failure point rather than handing off to the scheduler.

@wenshao

wenshao commented May 15, 2026

Copy link
Copy Markdown
Collaborator Author

For reviewers — this PR is the minimal, surgical fix. The broader weak-network resilience story has two follow-ups filed so they can be reviewed independently:

Together (#4176 + #4177 + #4178) qwen-code's behavior on weak networks would be comparable to upstream Claude Code, addressing the same class of bug as anthropics/claude-code#6836 (open meta-issue with ~155 duplicate reports, oncall label).

Extends the partial-history fix in fe35e37 to cover the residual race
paths surfaced in PR QwenLM#4176 review:

  - Race A: Ctrl+Y while in-flight tool hasn't finished. History is
    [user, model(tool_use)] — `stripOrphanedUserEntriesFromHistory` only
    pops trailing user entries, so the retry payload lands as a fresh
    user turn after the orphan tool_use and API rejects. Meanwhile the
    scheduler's `onAllToolCallsComplete` is single-shot and gated on
    `isResponding`, so the eventual tool_result is silently swallowed.
  - Race B: process crash / OOM / SIGKILL between the partial-tool_use
    push and the React scheduler's tool_result submission. On `--resume`
    the dangling model(tool_use) wedges the first API call.
  - Race C: external tooling / manual JSONL edits leaving the same
    dangling shape.

The fix has three pieces working together:

1. `repairOrphanedToolUseTurns(history)` in geminiChat.ts walks history
   left-to-right and synthesizes an `error`-typed functionResponse for
   every functionCall whose id is not echoed back in the next user
   turn. Appends to an existing user turn when present, otherwise
   inserts a new one. Returns the injected (callId, name) list.

2. `GeminiClient.repairOrphanedToolUseTurnsInHistory()` wraps the helper
   and is called from three points:
     - `startChat()` after loading the transcript (Race B/C, --resume).
     - `sendMessageStream` Retry branch after stripOrphans (Race A).
     - `sendMessageStream` UserQuery/Cron branch (defensive belt-and-
       suspenders for anything that slipped past 1 and 2).

3. `handleCompletedTools` in useGeminiStream.ts dedupes against
   chat.history before submitting tool_results — if a synthetic
   functionResponse for the same callId is already present (planted by
   the repair pass), the in-flight scheduler's late result is dropped
   and the call is `markToolsAsSubmitted` so the UI advances. Same
   trade-off upstream Claude Code's `StreamingToolExecutor.discard()`
   makes — late real results are dropped on the wire after synthesis,
   the model sees the synthetic error and can retry the tool if it
   still wants the result.

Together with the partial-history push from fe35e37, every tool_use
that ever streamed to the consumer is guaranteed to have a matching
tool_result on the wire — regardless of whether the stream errored,
the user retried mid-flight, the process crashed, or the session is
later resumed. This is the qwen-code analogue of upstream Claude Code's
`yieldMissingToolResultBlocks` (query.ts:123-149), but split across
the core/cli boundary because the React tool scheduler runs out-of-band
from the stream loop (so the synthesis path can't atomically discard
in-flight tools the way upstream's StreamingToolExecutor can; the
history-dedup at handleCompletedTools fills that gap instead).

Tests:
  - 8 new repair-helper tests in geminiChat.test.ts cover Race A,
    Race B, partial coverage of parallel tool_use, idempotence on
    already-paired history, no-op on tool-free history, caller-
    supplied reason text, multiple non-adjacent dangling rounds, and
    routes through the GeminiChat instance-method wrapper.
  - client.test.ts mocks updated for the new GeminiChat method.
  - All 88 geminiChat tests pass; 131 client tests pass; 91
    useGeminiStream tests pass.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread packages/core/src/core/geminiChat.ts
Comment thread packages/core/src/core/geminiChat.ts Outdated
Post-review audit of 3de3241 surfaced two real races the earlier fix
did not actually close:

  - Race A (the one yiliang114 originally flagged): handleCompletedTools
    is gated on `isResponding` (useGeminiStream:1971) BEFORE the dedup
    branch ran, so when the user Ctrl+Y'd mid-tool, the dedup never
    fired and the in-flight tool stayed permanently stuck in
    `completed-but-not-submitted` (the scheduler's
    `allToolCallsCompleteHandler` is single-shot). The dedup branch was
    structurally unreachable on the very race it was meant to defend
    against.

  - Race in the Retry path: the repair pass in
    `client.sendMessageStream` Retry branch ran BEFORE the chat-internal
    `push(userContent)`, so a Retry of a previous ToolResult submission
    (lastPrompt is a functionResponse part array) raced its own real
    `functionResponse` against the synthesized error one. The synthesis
    was winning and pre-empting the real result, producing two
    `functionResponse` entries with the same callId on the wire.

This commit relocates both pieces to where the contract actually holds:

1. Move the dedup branch in `handleCompletedTools` to BEFORE the
   `isResponding` early-return so `markToolsAsSubmitted` always runs
   for callIds already paired in history, unblocking the UI/scheduler
   even when a new stream is in flight. The submission-side `geminiTools
   = completedAndReadyToSubmitTools.filter(... && !inHistory ...)` then
   covers the no-double-submit case in one expression.

2. Move the repair call from `client.sendMessageStream` Retry branch
   into `chat.sendMessageStream` immediately AFTER the user-supplied
   turn is pushed. The user's own tool_result (when the Retry payload
   carries one) gets the first chance to close the pair before the
   synthesizer sees it as dangling. `client.startChat()` keeps a
   belt-and-suspenders pass at session-load time so any pre-send code
   reading `chat.history` sees a well-formed shape.

Adds three integration tests covering the new wiring:

  - geminiChat.test.ts: chat.sendMessageStream synthesizes a
    functionResponse when history carries a dangling tool_use AND the
    user-supplied content doesn't already close it (Race B/C plus
    Race A from the chat side).
  - geminiChat.test.ts: chat.sendMessageStream does NOT synthesize
    when the user-supplied content IS a matching functionResponse
    (the Retry-of-ToolResult race the previous wiring tripped on).
  - useGeminiStream.test.tsx: handleCompletedTools dedups a late real
    result whose callId already has a functionResponse in chat.history
    (Race A end-to-end: markToolsAsSubmitted fires but no
    submitQuery is dispatched).

Test summary: 90/90 geminiChat, 131/131 client, 92/92 useGeminiStream;
all 8186 core tests pass; tsc clean.

Known limitation logged for follow-up: a Retry of ToolResult whose
intervening stream itself partial-pushed a SECOND tool_use produces a
trailing user turn with both the stale re-pushed `fr_A` and the
synthesized `fr_B` for the second call — `fr_A` retry lands as an
orphan because `fc_A` was already paired earlier in history. Triggered
only when partial-push fires on the second stream AND user retries the
tool_result rather than the user prompt; extremely low frequency in
practice. Cleanest fix is in the retryLastPrompt layer (don't re-push
an already-paired tool_result), which is out of scope for this PR.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 2 comments.

Comment thread packages/cli/src/ui/hooks/useGeminiStream.ts
Comment thread packages/core/src/core/geminiChat.ts Outdated
@wenshao wenshao changed the title fix(core): persist partial assistant turn when stream errors mid tool_use fix(core,cli): close tool_use↔tool_result invariant across all failure paths May 15, 2026
@wenshao wenshao requested a review from yiliang114 May 15, 2026 15:19

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Approve to Comment: self-PR.

Review Summary: 2 files, +270/-36. Fixes unrecoverable wedge when streaming drops between tool_use content_block_stop and message_stop. Core logic is correct; 606 tests pass; tsc + eslint clean.

No high-confidence Critical issues found. 5 low-confidence suggestions identified (not posted as inline comments — needs human review):

  1. AbortSignal mid-tool_use persists partial turn — catch 块未区分 abort 与网络错误,取消后 history 含孤立 tool_use。建议加 守卫。
  2. Zero logging on partial turn persistence — 静默修改 history,调试困难。建议加 。
  3. recordAssistantTurn 在错误路径无条件调用 — 失败流的部分数据被记录为正常 turn。建议移入 块。
  4. Missing thinking-only test — 推理模型场景(thought:true 无 functionCall)无回归覆盖。
  5. Duplicate history.push 代码 — 错误/成功路径各一份相同实现,建议提取方法。

— DeepSeek/deepseek-v4-pro via Qwen Code /review

@wenshao wenshao left a comment

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Downgraded from Request changes to Comment: self-PR.

Comment thread packages/cli/src/ui/hooks/useGeminiStream.test.tsx Outdated
Comment thread packages/core/src/core/geminiChat.ts Outdated
Comment thread packages/core/src/core/client.ts
Review surfaced four issues on top of b2fed61. All addressed here:

  - [P0] Synthetic functionResponse was appended to the trailing user
    turn's parts; on a Ctrl+Y race the shape became `user[text, fr]` and
    Anthropic-compatible backends still 400 because tool_result blocks
    must come FIRST in a user message (Claude Code does the same hoist
    in utils/messages.ts:hoistToolResults). Repair now slots synthetic
    fr in before the first non-functionResponse part, after any
    pre-existing real fr parts (which preserves real-tool-result order
    on a parallel-partial-submit). Tests asserting the part order added.

  - [P0] The dedup test fixture in useGeminiStream.test.tsx forced an
    incompatible object through a direct cast to TrackedCompletedToolCall;
    `tsc` rejected it with TS2352. Fixture restructured to match the
    type's actual shape (matches the existing `Mocked` patterns at
    L597-606 / L2396-2401) so the changed-file typecheck stays green.

  - [P1] On stream error, `recordAssistantTurn` ran unconditionally
    (gated only on parts being non-empty), so partial text-only turns
    we intentionally do NOT push into in-memory history were still
    written to the chat-recording JSONL — leaving --resume's
    transcript-load path to re-inject a model turn the in-session run
    discarded. Gate now matches the `history.push` decision: record
    iff we will persist (success path OR stream-error-with-tool_use).

  - [P2] Added client-level tests for the startChat repair wiring:
    extraHistory ending in `model[functionCall]` triggers synthetic fr
    injection on init; happy-path resume is a no-op. Defends against a
    future reorder/removal of the call in startChat() regressing the
    --resume recovery path.

All 91 geminiChat, 133 client, 92 useGeminiStream tests pass. Full
core suite (8186 tests) green. Changed-file tsc clean except for the
pre-existing `recordCompletedToolCall` typing gap that exists on main.
@wenshao

wenshao commented May 16, 2026

Copy link
Copy Markdown
Collaborator Author

Addressed all review feedback in fc4d57b6e. Four issues from the inline threads + self-review pass:

[P0] tool_result ordering in synthesized user turns (gpt-5.5 thread) — The repair pass was appending synthetic functionResponse to the trailing user turn's parts, so a Ctrl+Y race produced user[text, fr] and Anthropic-compatible backends still 400 ("tool_result must follow tool use"). Upstream Claude Code's utils/messages.ts:hoistToolResults confirms the requirement. Repair now slots synthetic fr before the first non-functionResponse part, and after any pre-existing real fr parts (preserves real-tool-result order on a parallel-partial-submit). Two test assertions added to lock the order.

[P0] TS2352 in the dedup test fixture — My lateRealResult was forcing an incompatible object through a direct cast to TrackedCompletedToolCall; vitest (esbuild) accepted it but tsc rejected with TS2352. Fixture restructured to match the type's actual shape (matches the existing Mocked patterns at L597-606 / L2396-2401). Changed-file tsc now clean for this file.

[P1] recordAssistantTurn divergence on stream error (Copilot thread) — Recording previously ran on every stream error (gated only on parts being non-empty), so text-only partials we intentionally don't push to in-memory history were still written to chat-recording JSONL — --resume's transcript-load path would then re-inject a model turn the in-session run discarded. Recording gate now matches the history.push decision: record iff we will persist (success path OR streamError + hasToolCall + parts). The "don't persist plain-text partials" intent is now consistent across in-memory and on-disk state.

[P2] startChat resume integration test (gpt-5.5 thread) — Added two client.test.ts cases for the resume-time repair wiring: (a) extraHistory ending in model[functionCall] triggers synthetic fr injection on init; (b) happy-path resume is a no-op. Defends against a future reorder/removal of the repairOrphanedToolUseTurnsInHistory() call in startChat() regressing the --resume recovery path.

Test summary: 91/91 geminiChat, 133/133 client, 92/92 useGeminiStream; full core suite (8186 tests) green; changed-file tsc clean.

Untouched (acknowledged but out-of-scope):

  • AbortSignal vs network-error in the partial-push branch (self-review pre-release: fix ci #1) — currently both paths trigger the partial push. Real user-abort path in useGeminiStream.handleUserCancelledEvent runs addHistory({ role: 'user', parts: combinedParts }) with cancellation responseParts for every in-flight tool, so the dangling model[fc] gets paired by the user-cancel path's own machinery; the partial-push isn't actually orphaning anything. If we ever want a cleaner separation, the if (signal?.aborted) guard belongs around the partial-push itself, not the synthesis (which already handles the abort + non-cancellation case correctly).
  • Refactor the two history.push({role: 'model', parts: [...]}) call sites into a single helper (self-review TypeError in Authentication Selection Interface #5) — both call sites pass exactly the same expression but diverge on when they're reached. Extracting now would obscure the success-vs-recovery distinction the comment block above each push explains; happy to revisit if a future change adds a third call site.

Comment thread packages/cli/src/ui/hooks/useGeminiStream.ts Outdated
…wenLM#4176)

One Suggestion thread from the deepseek-v4-pro pass on commit
c30bba6.

`MockedGeminiClientClass` did not expose `getHistoryFunctionResponseIds`,
so 3 of the 4 dedup tests in `useGeminiStream.test.tsx` fell through
to the `else if (typeof geminiClient.getHistory === 'function')`
branch — the `structuredClone(this.history)` slow path. Only the
mixed-batch test added in c30bba6 wired the fast path explicitly.
Net effect: a regression in `getHistoryFunctionResponseIds` (wrong
ids, missing ids, exception) would silently re-route every dedup
batch onto the slow clone path with all 4 tests still green, while
production paid the multi-millisecond UI-thread stall this PR
specifically added the accessor to avoid.

Fix in two parts:

1. Add `getHistoryFunctionResponseIds = vi.fn().mockReturnValue(new
   Set<string>())` to the `MockedGeminiClientClass` default. Every
   test now exposes the fast-path accessor, so the dispatcher takes
   the `getHistoryFunctionResponseIds` branch by default — matching
   production. Tests that need a non-empty dedup set override the
   mock explicitly.

2. Override the mock in each of the three previously-affected dedup
   tests with a `Set` containing the callId(s) their `getHistory()`
   fixture had paired. The dedup assertions stay identical — but
   the path the production code takes to reach them now matches
   the path under test.

Tests: 96/96 useGeminiStream (no new tests; existing dedup tests
now exercise the fast path the previous commit added). 353/353
across core + cli. tsc + eslint + prettier clean.

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed the 8 commits since my last pass (2dbfc4e06a6951). The deferred-flush mechanism directly addresses the chat-recording inconsistency I flagged — design looks correct and the lifecycle pairing (set → pop on retry-success / flush on unretryable break) is clean. Hoist + duplicate-drop in the repair pass close real wedge paths I hadn't considered. A few observations:


P1 — repairOrphanedToolUseTurns complexity

The function grew from ~50 to ~200 lines with scan + hoist + dedup + splice + empty-turn cleanup all interleaved in one forward-walk loop. The allRemovalTargets sort-by-desc-then-splice approach is correct today, but the interleaving of scan and mutation phases makes it brittle for future additions. If another fix-up category lands here (e.g. multi-model parallel tool dispatch), I'd expect index-drift bugs.

Suggestion for a follow-up: refactor into scan phase (collect locations) → decision phase (classify synthesize/hoist/drop) → mutation phase (apply all changes). Not blocking, but flagging because this is the most complex single function in the repair subsystem now.

P1 — finally block swallows recording errors silently

try {
  self.chatRecordingService?.recordAssistantTurn(...);
} catch (recordErr) {
  debugLogger.warn('[PARTIAL_FLUSH] Failed to persist ...');
}

If JSONL writes fail persistently (disk full, permission), every deferred partial is silently lost. The "eventual consistency" trade-off is fine for transient failures, but a sustained failure deserves an error-level log or an increment of a failure counter so monitoring can catch it.


P2 — Minor suggestions (non-blocking)

  1. addHistory warn log: the comment says "today this can't happen" — if it's a true invariant violation, debugLogger.error with "invariant violation" framing would be more appropriate than warn (which is easy to ignore in noisy logs).

  2. Three-branch duck-type fallback in useGeminiStream: the getHistory() fallback exists for test mocks, but all mocks now implement getHistoryFunctionResponseIds. The middle branch is dead code in production. Consider removing it and enforcing the interface contract in tests.

  3. PR-thread references in code comments: phrases like "gpt-5.5 review thread on PR #4176" and "qwen-latest-series-invite-beta-v34 thread on PR #4176" are useful during review but become noise after merge. Worth a cleanup pass before landing — keep the "why" explanation, drop the reviewer/thread citations.

  4. Test coverage gap: the original should retry on TPM throttling StreamContentError with initial delay and should use Retry-After delay for streamed rate-limit errors tests were fully replaced. The new tests cover the partial-rollback angle well, but the original assertions on delay timing and Retry-After header parsing don't have direct equivalents. If those behaviors are covered elsewhere, fine — otherwise it's a small regression in coverage.


Overall: the core fix is solid and the test matrix is thorough. The main concern is long-term maintainability of the repair function — everything else is polish.

…#4176)

Five items from the yiliang114 review on commit 06a6951. Two P1
(observability + structural maintainability) and three editorial.

[A] P1 — repairOrphanedToolUseTurns refactored into three pure
phases: scanModelTurn / planRepair / applyRepair. Each runs in
isolation against narrow `ScanResult` / `RepairPlan` types, and only
the last mutates `history`. The outer forward-walk loop is now a
~25-line orchestrator: scan → bail on empty expected → plan → bail
on empty plan → apply → record `injected` + `droppedDuplicates` →
advance cursor past any freshly-inserted user turn. Index drift can
only happen in applyRepair, so an audit narrows to one function.
Behavior identical to the pre-refactor monolith — all 13 existing
repair tests pass unchanged.

[B] P1 — finally-block JSONL flush error log promoted from
`debugLogger.warn` to `debugLogger.error`. A persistent write
failure (disk full, permission denied, serialization broken)
silently loses every deferred partial after the first occurrence;
that's the class of failure that warrants monitoring attention, not
the warn category that fades into log noise. Single-occurrence
transient failures still surface as one error per occurrence.

[C] P2 — addHistory partial-marker log promoted to `debugLogger.error`
with `[INVARIANT_VIOLATION]` tag. The existing call graph cannot
legitimately hit this branch (it only fires when addHistory runs
mid-sendMessageStream, which no current caller does); any future
hit is a true bug in a new caller, not noise.

[D] P2 — removed the `getHistory()` fallback branch in
`useGeminiStream.handleCompletedTools`. Verified via grep: every
GeminiClient consumer (real production code + MockedGeminiClientClass
+ every individual dedup test) now exposes
`getHistoryFunctionResponseIds`, so the three-branch duck-type
dispatch was dead-code at production level and only existed to
support an interim cycle when the fast-path accessor was being
introduced. The dispatcher is now a one-liner; production path is
strictly the fast accessor, with an empty Set returned if `geminiClient`
itself is missing (only happens in hook-level unit tests with no
client at all).

[E] P2 — cleanup pass dropping reviewer/thread citations from code
comments (e.g. "qwen-latest-series-invite-beta-v34 thread on PR
QwenLM#4176", "gpt-5.5 review thread", "deepseek-v4-pro thread",
"yiliang114 repro"). The "why" explanations stay — those are the
load-bearing context. Reviewer/thread attribution lives in the
PR history; embedding it in long-term comments was noise that
ages poorly. Touched 9 sites in geminiChat.ts, 2 in client.ts,
1 in useGeminiStream.ts, 4 in useGeminiStream.test.tsx, and 8 in
geminiChat.test.ts (test names + comment blocks).

Also: P2.4 (yiliang114 flagged a potential test-coverage regression
on TPM throttling + Retry-After delay) verified as a false alarm
— `should retry on TPM throttling StreamContentError with initial
delay` (geminiChat.test.ts:2959) and `should use Retry-After delay
for streamed rate-limit errors` (line 3035) both still exist.

Tests: 353/353 (no new tests; all existing pass through the
refactored repair function and the simplified dispatcher). tsc +
eslint + prettier clean.
@wenshao wenshao requested a review from yiliang114 May 20, 2026 17:14
yiliang114
yiliang114 previously approved these changes May 21, 2026

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. The three-phase refactor is clean, error-level promotions look right, and the dead fallback branch is gone. Ship it.

@pomelo-nwu

Copy link
Copy Markdown
Collaborator

Comment density in packages/core/src/core/geminiChat.ts — non-blocking, readability/maintainability.

The fix itself looks sound; this note is only about comment volume. Roughly 60% of the +918 lines in geminiChat.ts are comments, with many 20–40 line prose blocks, and several of them document hypothetical futures or code the diff itself labels as no-op / unreachable. A few examples:

  • the popPartialIfPushed() call in the CompressionStatus.COMPRESSED branch — the comment itself calls it a "defense-in-depth no-op", then spends ~12 lines justifying keeping it;
  • the recovery-path pop before the OUTPUT_RECOVERY_MESSAGE handling — a ~38-line block;
  • addHistory — ~30 lines describing a future caller that "cannot legitimately hit this branch";
  • the tool_use_id ... must have a corresponding tool_use block wedge is re-explained nearly in full in 5+ separate places.

Why I'd trim it:

  • comments drift out of sync as the code evolves, and the long speculative ones rot fastest;
  • signal-to-noise — the load-bearing invariant gets buried, which makes future review and maintenance harder rather than easier.

To be clear, I'm not arguing against documenting the invariants: the bug is subtle and already caused a production wedge, so capturing the tool_use↔tool_result contract is well justified. It's the volume and the speculative future-proofing prose I'd push back on.

Suggestion (fine as a follow-up, shouldn't gate the merge):

  • aim to cut roughly in half — keep a 1–2 line pointer at each site;
  • consolidate the race-class analysis and the "why this no-op is kept" rationale into a single design note (one section-level block comment, or the PR description / a design doc) and link back to it;
  • drop or one-line the branches the code already labels "currently unreachable" / "defense-in-depth no-op";
  • state each invariant once in a canonical place instead of re-deriving the 400 explanation at every call site.
中文

packages/core/src/core/geminiChat.ts 注释体量过大 —— 非阻塞,仅可读性 / 可维护性。

修复本身没问题,这条只针对注释体量。geminiChat.ts 新增的 918 行里约 60% 是注释,多处是 20–40 行的大段文字,其中几处在长篇描述假想的未来场景、或 diff 自己都标注为 no-op / 不可达的代码。几个例子:

  • CompressionStatus.COMPRESSED 分支里的 popPartialIfPushed() —— 注释自己写明这是 "defense-in-depth no-op",却又花了约 12 行解释为什么保留;
  • OUTPUT_RECOVERY_MESSAGE 处理前的 recovery-path pop —— 约 38 行;
  • addHistory —— 约 30 行描述一个"当前调用图不可能触发"的未来调用者;
  • tool_use_id ... must have a corresponding tool_use block 这个 wedge,在 5 处以上几乎被完整重复解释了一遍。

为什么建议精简:

  • 注释会随代码演进与实现脱节,越是长篇的"假想未来"越容易腐烂;
  • 信噪比 —— 真正承重的不变量被淹没,反而让后续 review 和维护更难。

需要说明:我并不反对记录这些不变量 —— 这个 bug 很微妙、且已造成过线上死锁,把 tool_use↔tool_result 的契约写下来完全合理。我想商榷的是注释的体量,以及那些防御性的"未来可能"长篇推演。

建议(作为后续提交即可,不应阻塞合并):

  • 目标砍掉约一半 —— 每处保留 1–2 行点睛即可;
  • 把 race 分类分析、以及"为什么保留这个 no-op"的理由,集中到一处设计说明(一个 section 级块注释,或 PR 描述 / 设计文档),其余地方链接过去;
  • 对代码已标注 "currently unreachable" / "defense-in-depth no-op" 的分支,直接删除或压成一行;
  • 每个不变量只在一处权威位置陈述一次,不要在每个调用点都把 400 的成因重新推导一遍。

wenshao added 2 commits May 21, 2026 11:19
…ool-use-on-stream-error

# Conflicts:
#	packages/core/src/core/geminiChat.ts
…ign block (PR QwenLM#4176)

Addresses the pomelo-nwu review observation on c30bba6:
~60% of the +918 lines added to geminiChat.ts were comments, with
several 20–40 line prose blocks repeating the same race-class /
tool_use_id wedge analysis at every call site. Net –139 / +117
(net delete ~22 lines but mostly redistributing weight from per-site
blocks to a single canonical note).

Changes:

* Add one canonical design block above ORPHAN_TOOL_USE_REPAIR_REASON
  covering: the tool_use_id 400 wedge, the three race classes (A
  Ctrl+Y mid-flight, B process crash, C SSE drop), the two-layer fix
  (partial-push + repair), and the partial-push marker lifecycle.
  Every per-site comment that used to repeat this now points back.

* Trim popPartialIfPushed COMPRESSED-branch comment from 14 lines to
  4 (no-op-today + reason for keeping).

* Trim recovery-catch comment block from ~38 lines to ~7 (pop-order
  matters + index-checked, pointer to canonical note).

* Trim addHistory invariant comment from ~22 lines to ~5 (when this
  fires + why error-level, pointer to canonical note).

* Trim processStreamResponse partial-push comment from ~22 lines to
  ~8 (Race C name + signal + plain-text-not-persisted rule).

* Replace tool_use_id wedge re-explanations at 5 sites (repair doc,
  scanModelTurn doc, applyRepair doc, popPartialIfPushed comment,
  processStreamResponse comment) with single-line pointers. Canonical
  block at line 411 is now the only authoritative copy.

Reviewer marked this as non-blocking ("shouldn't gate the merge");
done as cleanup polish while LGTM is in.

Tests: 375/375 (no test changes). tsc + eslint + prettier clean.
@wenshao

wenshao commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

@pomelo-nwu thanks for the careful read — the cleanup landed in 298588190. Took your suggestions point-by-point:

  • Single canonical design note: one block above ORPHAN_TOOL_USE_REPAIR_REASON now covers the tool_use_id 400 wedge, the three race classes (A Ctrl+Y mid-flight, B process crash, C SSE drop), the two-layer fix, and the partial-push marker lifecycle. Every per-site comment that used to repeat this is now a one-line pointer.
  • One-lined the unreachable / no-op branches: the popPartialIfPushed COMPRESSED-branch comment went from 14 lines to 4 (no-op today — kept for uniformity in case a future in-place tryCompress stops resetting the marker). The InvalidStreamError content-retry "no reachable test path until the predicates diverge" note stays short.
  • Trimmed the speculative / future-caller prose: recovery-catch block ~38 → 7, addHistory invariant ~22 → 5, processStreamResponse partial-push ~22 → 8. The wedge text itself was re-explained nearly in full at 5 sites — only the canonical block has it now.

Net –139 / +117 in geminiChat.ts (so about the cut-in-half you suggested). Behavior unchanged; 375/375 still pass.

The invariants you flagged as load-bearing (tool_use↔tool_result contract, marker lifecycle, hoist before tail-append) all kept their "why" — just lifted out of the per-site duplicates and into one canonical place so future drift hits one site instead of five.

中文

@pomelo-nwu 感谢细读 —— 清理已经在 298588190 落地,按你的建议逐条处理:

  • 单一权威设计说明:在 ORPHAN_TOOL_USE_REPAIR_REASON 之上加了一个块注释,统一记录 tool_use_id 400 wedge、三个 race 类别(A Ctrl+Y 中途、B 进程崩溃、C SSE 断流)、两层修复架构、以及 partial-push marker 的生命周期。所有原本在各调用点重复的解释都改成了一行指针。
  • 不可达 / no-op 分支压成一行popPartialIfPushed 的 COMPRESSED 分支注释从 14 行砍到 4 行("今天是 no-op,保留是为了将来 in-place tryCompress 不再重置 marker 的情况")。InvalidStreamError content-retry 那段 "predicates diverge 前没有可达测试路径" 的说明也保持极简。
  • 删掉防御性 / 未来调用者的长篇推演:recovery catch ~38 → 7、addHistory invariant ~22 → 5、processStreamResponse partial-push ~22 → 8。wedge 这段文字原本在 5 处几乎被完整重写,现在只在 canonical block 出现一次。

geminiChat.ts 净变化 –139 / +117(大致就是你建议的"砍掉一半"的量)。行为不变;375/375 测试仍然通过。

你标记为承重的几个不变量(tool_use↔tool_result 契约、marker 生命周期、hoist 必须在 head 而非 tail)都保留了 "why",只是从各处重复迁移到一处权威位置 —— 以后再漂移只会集中在一个地方而不是五处。

@wenshao

wenshao commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

🧪 Local full-suite test report @ `298588190`

Ran `npx vitest run packages/core packages/cli` locally on the latest PR head to back up the green CI check.

Overall

Test Files 712 passed / 1 failed / 713 total
Tests 15,851 passed / 1 failed / 12 skipped (15,864 total)
Duration 286s (~4.8 min)
TSC core + cli ✅ clean (`--noEmit`)
ESLint changed files ✅ clean
Prettier changed files ✅ clean

PR-touched suites — all green

File Tests Time
`packages/core/src/core/geminiChat.test.ts` 123 / 123 6.29s
`packages/cli/src/ui/hooks/useGeminiStream.test.tsx` 101 / 101 0.62s
`packages/core/src/core/client.test.ts` 151 / 151 1.65s

Covers everything this PR added: partial-push marker invariants × 6, duplicate-fr drop × 2, recovery-path pop ordering, escalation flush, transient-retry rollback, `getHistoryFunctionResponseIds` unit tests × 6, mixed-batch dedup fast-path assertion.

The 1 failure — pre-existing on `main`, unrelated to this PR

```
FAIL packages/core/src/core/anthropicContentGenerator/anthropicContentGenerator.test.ts

treats unset baseURL as Anthropic-native (SDK default targets api.anthropic.com)

expected 'claude-cli/1.2.3 (external, cli)' to contain 'QwenCode/1.2.3'
at anthropicContentGenerator.test.ts:262:35
```

Verification: temporarily `git checkout origin/main -- packages/core/src/core/anthropicContentGenerator/` and re-ran that single case → still fails on main alone. This PR doesn't touch any `anthropicContentGenerator/*` files. The test was authored 2026-05-11 (`bdd5b602de`), and the regression appears to stem from `4b25f9c05` (`fix(core): set x-api-key alongside Authorization on Anthropic outbound (#4323)`) — expected the build-time-injected `QwenCode/1.2.3` but receives the hard-coded `claude-cli/...` UA.

→ Worth a separate follow-up issue for the Anthropic content generator maintainer; out of scope for #4176.

Conclusion

Zero regressions from this PR across the full `packages/core` + `packages/cli` matrix (15,863/15,863 passing once you exclude the pre-existing unrelated fail). Safe to merge.

中文

🧪 本地全量测试报告 @ `298588190`

为 CI 绿灯背书,本地跑了 `npx vitest run packages/core packages/cli`。

整体15,851 / 15,864 通过(1 fail / 12 skipped),耗时 286s。TSC core+cli、ESLint、Prettier(已改动文件)全部 clean。

PR 涉及的三个测试文件全绿:geminiChat.test.ts (123/123)、useGeminiStream.test.tsx (101/101)、client.test.ts (151/151) —— 覆盖了本 PR 新增的所有回归测试。

那 1 个失败与本 PR 无关,main 上已存在:`anthropicContentGenerator.test.ts:262` — 期望 `QwenCode/1.2.3` 但实际拿到 `claude-cli/...`。验证手段:把 `anthropicContentGenerator/` 临时 checkout 到 `origin/main` 后单独跑该 case 依然 fail。本 PR 没动该目录任何文件。可能源于 `#4323` 的 Anthropic 出站 header 改造。建议另起 follow-up issue。

结论:本 PR 在完整测试矩阵上零回归,可以放心 merge。

yiliang114
yiliang114 previously approved these changes May 21, 2026

@yiliang114 yiliang114 left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. All 38 review threads resolved, CI green on Linux/macOS, test coverage comprehensive (353+ tests). The three-phase repair refactor is clean and the partial-push lifecycle is well-guarded. Ship it.

…nLM#4176)

Second cleanup pass on pomelo-nwu's feedback. Previous commit
(2985881) cut net –22 lines but several blocks were still
20–25 lines of prose; this pass takes the rest down to the
one-or-two-line pointer pattern the reviewer suggested.

geminiChat.ts: 2384 → 2207 lines (-177). Comment density 37% → 32%.

Trimmed:

  * pendingPartialAssistantTurnIndex + pendingPartialAssistantRecord
    field doc blocks (16 + 22 lines → one combined 5-line block
    pointing at the canonical note).
  * clearPendingPartialState method doc (10 → 4 lines).
  * sendMessageStream inline-repair comment block (29 → 7 lines).
  * `finally` JSONL flush block (32 → 8 lines).
  * repairOrphanedToolUseTurns function doc (39 → 14 lines).
  * scanModelTurn / planRepair / applyRepair phase docs (11–22 → 4–9
    lines each).
  * RepairPlan interface doc (14 → 1 line).
  * GeminiChat.repairOrphanedToolUseTurns wrapper doc (9 → 3 lines).
  * GeminiChat.getHistoryFunctionResponseIds doc (13 → 6 lines).

What stayed (intentional load-bearing context):

  * The canonical block above ORPHAN_TOOL_USE_REPAIR_REASON — that's
    the consolidation point every per-site one-liner points back to.
  * Pre-existing upstream JSDoc (ctor, sendMessageStream, getHistory,
    redactStructuredOutputArgsForRecording).
  * The Reactive compression no-op note (already at the 4-line cap).
  * The recovery-catch 7-line block (already at the cap).
  * The addHistory invariant 5-line block (already at the cap).

Tests: 375/375 pass on the affected suites. tsc + eslint + prettier
clean on changed files. Behavior unchanged.
@wenshao

wenshao commented May 21, 2026

Copy link
Copy Markdown
Collaborator Author

Second pass on the comment-density feedback — e2033ec87. The first commit (298588190) consolidated the wedge analysis into one canonical block but left several 20–25 line per-site blocks; this one takes them down to the 1–2 line pointer pattern.

geminiChat.ts: 2384 → 2207 lines (-177). Comment density 37% → 32%.

Trimmed (one-or-two-line pointer + canonical note reference):

  • partial-push field doc blocks (16+22 → 5 lines combined)
  • clearPendingPartialState (10 → 4)
  • sendMessageStream inline-repair block (29 → 7)
  • finally JSONL flush block (32 → 8)
  • repairOrphanedToolUseTurns function doc (39 → 14)
  • scan/decision/mutation phase docs (11–22 → 4–9 each)
  • RepairPlan interface (14 → 1)
  • GeminiChat.repairOrphanedToolUseTurns wrapper (9 → 3)
  • getHistoryFunctionResponseIds (13 → 6)

What's left untouched is intentional: the canonical note (the consolidation point), pre-existing upstream JSDoc (ctor / `sendMessageStream` / `getHistory`), and the three blocks already at the cap (Reactive compression no-op = 4 lines, recovery catch = 7, addHistory invariant = 5).

375/375 still pass; tsc + eslint + prettier clean.

中文

针对注释体量的第二轮清理 —— `e2033ec87`。第一次清理(`298588190`)把 wedge 分析集中到了一个 canonical block,但确实还留了几处 20–25 行的 per-site 大块没动;这次全部压成 1–2 行 pointer 的模式。

`geminiChat.ts`:2384 → 2207 行 (-177),注释密度 37% → 32%

压缩明细(每处只留 1–2 行 + 指向 canonical note):

  • partial-push 两个 field doc 块(16 + 22 行 → 合并 5 行)
  • `clearPendingPartialState`(10 → 4)
  • `sendMessageStream` 内联 repair 块(29 → 7)
  • `finally` JSONL flush 块(32 → 8)
  • `repairOrphanedToolUseTurns` 函数 doc(39 → 14)
  • scan/decision/mutation 三个阶段 doc(11–22 → 4–9 各自)
  • `RepairPlan` interface(14 → 1)
  • `GeminiChat.repairOrphanedToolUseTurns` wrapper(9 → 3)
  • `getHistoryFunctionResponseIds`(13 → 6)

保留不动的几处是有意的:canonical note(集中说明的唯一权威位置)、上游 js-genai 原本就有的 JSDoc(ctor / `sendMessageStream` / `getHistory`)、以及已经压到上限的三处(Reactive compression no-op = 4 行、recovery catch = 7、addHistory 不变量 = 5)。

375/375 仍然通过,tsc + eslint + prettier 全绿。

@BZ-D

BZ-D commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Following this closely — really appreciate the depth of the root-cause writeup, and especially the explicit contrast with upstream's in-band StreamingToolExecutor (atomic .discard() at the synthesis point vs. our out-of-band React scheduler workaround). That framing made the underlying architectural gap unusually clear.

Filed #4387 as an RFC to track the longer-term direction: moving tool dispatch from end-of-turn batched to stream-driven in-band, building on top of the invariant guarantees this PR establishes. Not asking for any action on this PR — the RFC is explicit that it builds on #4176 and intentionally avoids proposing anything that would conflict with the work in flight here. I'll revise the design notes once this lands.

Happy to discuss in #4387 or here, whichever you prefer.

@BZ-D BZ-D left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Solid fix for a genuinely tricky wedge. The 3-layer defense (partial push → repair pass → handleCompletedTools dedup) is well-chosen, and splitting repairOrphanedToolUseTurns into SCAN / DECISION / MUTATION phases makes it auditable. Test coverage at every failure point is impressive — particularly the regression guards for non-obvious invariants (dedup-before-isResponding, partial-push marker resets across all 6 history-mutation sites, escalation flush on finally).

A few non-blocking items worth considering:

1. Per-send repair pass cost — geminiChat.ts sendMessageStream

repairOrphanedToolUseTurns(this.history) runs on every send, walking the full history. For typical sessions this is microseconds, but on a 1000+ turn session with parallel tool_uses it could add up. Consider whether a "dirty bit" (e.g., set on stream error / setHistory / --resume load) is worth gating the pass on. Probably overkill given current usage, but worth a thought.

2. [PARTIAL_PUSH] log volume — geminiChat.ts partial-push branch

This warn fires on EVERY mid-stream tool_use error. On flaky networks (the exact use case this PR targets), it could spam logs. The other [REPAIR] warns are conditional on actual synthesis/drop happening, so they self-rate-limit; this one doesn't. Either rate-limiting or downgrading to info would help diagnostics stay signal-rich.

3. getHistoryFunctionResponseIds on public GeminiClient surface — client.ts:368

The method exists purely for the React hook's perf hot path. The JSDoc is clear about why, but it's a slight leak — the dispatcher contract (historyCallIdsWithResponse) is now spread across the core/cli boundary. A future refactor might consider wrapping it inside something like client.filterToolsForSubmission(...) so the cli side doesn't need to know about the dedup key shape. Not blocking.

4. Dead-code branch documented as "currently unreachable" — geminiChat.ts isContentError branch

The preserved popPartialIfPushed() call below the transient-stream retry. The comment is admirably honest and the rationale (defense-in-depth for a future error class that diverges the predicates) is sound. Some reviewers prefer to remove dead branches entirely and add them back when needed; others (myself included) think keeping the wiring with the WHY documented is fine. Your call.

5. Minor: comment density in tests

The WHY-style comments are appropriate for these subtle invariants and I'd rather have them than not, but a few blocks in the test file exceed the code they describe by 3–5×. Not asking to trim — just flagging that future readers may find the test file heavy to navigate.

Worth pinning explicitly

  • The dedup MUST run before the isResponding early-return — there's a dedicated regression test for this (runs Race A dedup BEFORE the isResponding early-return) which is exactly the right safeguard. Anyone refactoring handleCompletedTools later: do not move that block.
  • Every history-mutation method clearing the partial-push markers in lockstep is a load-bearing invariant. The partial-push marker invariants on history mutation describe block covers all 6 sites — keep it that way if new mutation methods are added.

LGTM with the optional considerations above. No request-changes.

@BZ-D BZ-D left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. 之前的 review 评论里有几条非阻塞建议可以后续考虑,但不影响合并。

@pomelo-nwu

Copy link
Copy Markdown
Collaborator

Real-environment verification (weak-network drop)

I ran an end-to-end weak-network reproduction against a real backend to check this PR's behavior, and want to share the data + an honest scope caveat.

Setup

Real CLI in tmux (dev mode, running the TS source) → a transparent SSE proxy that injects a mid-stream drop → dashscope deepseek-v4-pro (OpenAI-protocol endpoint). The proxy forwards the real upstream stream but destroy()s the socket right after the read_file tool_call's finish_reason:"tool_calls" + usage chunk, just before [DONE] — the OpenAI-path analogue of the Race-C drop point. Isolated HOME/cwd, YOLO mode. Prompt asks the model to read a file containing a secret code.

Result — main and PR behave identically; neither wedges

step main this PR
req #1 — drop after tool_calls finish API Error: terminated API Error: terminated
req #2 — Ctrl+Y retry (messages clean, no orphan tool_call) 200 200
req #3 — tool_use / tool_result correctly paired 200 → secret returned 200 → secret returned

Both versions recover cleanly on Ctrl+Y. Crucially, the retry request (#2) carries no dangling assistant[tool_calls] in either version — the wedge signature is absent on both. Proxy request-log (identical across main and PR):

REQUEST #1  system | user | assistant | user                                             -> 200, INJECTED DROP
REQUEST #2  system | user | assistant | user                                             -> 200  (Ctrl+Y retry, clean)
REQUEST #3  ... | assistant[tool_calls:read_file] | tool[tool_result:call_…]             -> 200  (paired)

Why the OpenAI path doesn't wedge

On the OpenAI path the pipeline holds the finish chunk and only yields the functionCall on the next chunk (pipeline.ts:248-274), so a drop before [DONE] rolls the whole turn back cleanly — no orphan is produced. The wedge this PR targets is specific to the anthropic protocol, where the functionCall is yielded at content_block_stop and the tool is still scheduled after the drop.

Honest scope / caveat

I only exercised the OpenAI path. I did not reproduce the anthropic-protocol wedge — the scenario this PR actually fixes. That conclusion rests on code reading + the PR description + the upstream claude-code#6836 analogy, not a live repro.

Takeaway

The fix addresses a real bug, but it appears anthropic-protocol-specific. OpenAI-compatible providers (incl. dashscope deepseek/qwen) don't hit this wedge at this drop point, so the "applies to openai-compatible providers too" framing may overstate the impact for those users. Since the change touches the shared core stream path for all providers, the unchecked manual weak-network test on the anthropic path is the highest-value item to complete before merge — it's the one scenario that demonstrates the fix's positive value.

中文(完整报告)

验证目标

复现 PR 描述的弱网 wedge:SSE 在 tool_use 的 content_block_stop 后、message_stop 前断流,导致 tool_use/tool_result 配对断裂、会话永久 400 卡死(Ctrl+Y 也救不回)。对比 main 版(含 bug)与本 PR 版(修复)的行为差异。

环境与方法

拓扑:tmux 里的真实 CLI --(http localhost)--> 断流代理 --(https)--> dashscope deepseek-v4-pro(OpenAI 协议)

  • 断流代理:透明转发真实上游流,一次性在第一个达到 finish_reason:"tool_calls" 的流上、转发完 finish + usage chunk 后、[DONE] 之前 socket.destroy(),精确模拟弱网在传输途中断开;后续请求全部透传。
  • 隔离环境:独立 HOME(OpenAI auth + YOLO + deepseek-v4-pro)与独立工作目录,避免污染真实配置、避免 YOLO 误操作。
  • 运行方式:dev 模式直接跑 TS 源码(main 分支 vs PR 分支 worktree)。
  • 测试 prompt:读取一个内含暗号的文件(用于确认工具真的执行)。

断流点精确性(代理 SSE 日志):read_file 的 tool_calls 增量 → finish_reason:"tool_calls" → usage chunk →(吞掉 [DONE] 并断开)

执行与证据

main 版:断流 → API Error: terminated → 提示 Ctrl+Y → 按 Ctrl+Y → ReadFile 执行 → 返回暗号。
PR 版:行为与 main 完全相同。

两版代理请求流逐行一致:

REQUEST #1  system | user | assistant | user                                  -> 200, 注入断流
REQUEST #2  system | user | assistant | user                                  -> 200(Ctrl+Y 重试,messages 干净、无 orphan tool_call)
REQUEST #3  … | assistant[tool_calls:read_file] | tool[tool_result:call_…]    -> 200(配对正确)

结论

在 dashscope deepseek(OpenAI 协议)这个真实场景下,main 与 PR 行为完全一致:断流后 Ctrl+Y 重试都能干净恢复,均不产生 wedge。关键证据:两版重试请求的 messages 都没有悬空的 assistant[tool_calls](orphan)——这正是 wedge 的特征签名,两版都没有。

原因(协议差异):

  • anthropic 协议:functionCall 在 content_block_stop 立即 yield → 断流后 tool 仍被调度 → 提交 tool_result → 请求 2 缺 tool_use → 永久 400 wedge。这是本 PR 真正修复的场景。
  • OpenAI 协议:pipeline 的 hold/merge 机制(pipeline.ts:248-274)让 functionCall 延迟到下一个 chunk 才 yield,断流(无 [DONE])后整个 turn 干净回滚,不产生 orphan。

限定(重要)

只实测了 OpenAI 协议路径anthropic 协议路径的 wedge 未实测——对它「确实会 wedge、且本 PR 能修复」的判断来自代码阅读 + PR 描述 + 上游 claude-code#6836 类比,把握较高但非眼见为实。也未覆盖 Race B(进程崩溃 + --resume)。

总结

本 PR 修的是真 bug,但看起来是 anthropic 协议特有的;OpenAI 兼容后端(含 dashscope deepseek/qwen)在这个断流点不触发该 wedge。鉴于改动压在所有 provider 共享的核心流式路径上,那项未勾选的 anthropic 弱网手动测试是合并前最有价值的一环——它正是唯一能证明本 PR 正面价值的场景。

@pomelo-nwu pomelo-nwu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wenshao wenshao merged commit 6b0e758 into QwenLM:main May 21, 2026
11 of 16 checks passed
xaelistic pushed a commit to xaelistic/qwen-code that referenced this pull request Jun 7, 2026
… for MCP related tools (QwenLM#4176)

Co-authored-by: Jack Wotherspoon <jackwoth@google.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type/bug Something isn't working as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Close tool_use↔tool_result invariant at failure point (defense in depth)

6 participants